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Abstract 

We apply a family-based extension of the sequence kernel association test (SKAT) to 93 trios extracted from the 20 
pedigrees in the Genetic Analysis Workshop 18 simulated data. Each extracted trio includes a unique set of parents 
to ensure conditionally independent trios are sampled. We compare the empirical type I error and power between 
the family-based SKAT and the burden test under varying percentages of causal single-nucleotide polymorphisms 
included in the analysis. Our investigation using simulated data suggests that, under the setting used for Genetic 
Analysis Workshop 18 data, both the family-based SKAT and the burden test have limited power, and that there is 
no substantial impact of percentage of signal on the power of either test. The low power is partially a result of the 
small sample size. However, we find that both the family-based SKAT and the burden test are more powerful when 
we use only rare variants, rather than common variants, to test the association. 



Background 

Genome-wide association studies (GWAS) have proven 
to be a powerful approach to identify novel common sin- 
gle-nucleotide polymorphisms (SNPs) contributing to the 
etiology of complex traits [1]. However, identifying rare 
genetic variants with minor allele frequency (MAF) <5% 
that are associated with complex diseases remains chal- 
lenging. Standard statistical tests for common variants 
(MAF >5%) are underpowered for rare variants because 
of their low frequencies and moderate effect sizes. Even 
with appropriate methods, larger sample sizes are 
required to have variation in the rare variants [2,3]. 

A major limitation of population-based association 
analyses is the potential for unrecognized population het- 
erogeneity as a result of population stratification. This 
problem, however, can be well addressed through the use 
of family-based studies, which use related individuals in 
association studies. Family-based controls eliminate the 
need to adjust for population structure [4,5]. Another 
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advantage of using family-based controls is the ability to 
identify and correct technological artifacts in the data, 
investigations of questions such as parent-of-origin 
effects and other applications that are imperfectly or not 
readily addressed in case-control association studies [4,5]. 

The data set for the Genetic Analysis Workshop 18 
(GAW18) consists of whole genome sequence data from 
a pedigree-based sample. These pedigrees are drawn 
from the Type 2 Diabetes Genetic Exploration by Next- 
generation sequencing in Ethnic Sample Project 2 
(T2D-GENES Project 2). The T2D-GENES Project 2 is 
designed to identify low-frequency or rare variants influ- 
encing susceptibility to type 2 diabetes using information 
from whole genome sequencing of 1043 individuals from 
20 Mexican American pedigrees enriched for type 2 dia- 
betes from San Antonio, Texas. The pedigree data are 
drawn from 2 San Antonio-based family studies: the San 
Antonio Family Heart Study (SAFHS) and the San Anto- 
nio Family Diabetes/Gallbladder study (SAFDGS). 

A variance component test, known as a sequence kernel 
association test (SKAT), is proposed for testing associa- 
tions of rare variants in population-based designs [2,6]. 
SKAT is shown to be powerful when rare variants have 
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effects in different directions, and it is computationally 
efficient because of the simple limiting distribution of the 
test statistic. However, SKAT is designed for testing asso- 
ciations in unrelated subjects and cannot be directly 
applied to family-based designs. Some investigators have 
proposed an extension of SKAT to family-based designs 
[7], hereafter referred to as family-based SKAT. In this 
report, we apply it to the GAW18 simulated data and 
explore more features of the test statistics. 

Methods 

The data set for GAW18 includes 959 individuals out of 
1043 individuals from 20 Mexican American pedigrees of 
the T2D-GENES Project 2. We conducted our analysis 
using the simulated phenotypes, baseline systolic blood 
pressure (SBP) and diastolic blood pressure (DBP), and 
the whole genome sequenced and imputed genotypes of 
the 959 correlated individuals. To keep the notation sim- 
ple and make our discussion transparent, we considered 
trio designs, acknowledging the fact that the method is 
applicable to more general family structures. 

Trio selection 

Conditionally independent trios were extracted from the 
20 extended pedigrees. Each extracted trio included a 
unique set of parents. For nuclear families with more than 
1 offspring, we randomly selected 1 offspring and formed 
a trio with the parents. Specifically, the individuals were 
grouped into families by the parents' identifications, and 1 
offspring was selected with equal probability to form a 
trio. We only selected trios that had complete genotype 
data for all 3 family members. Finally, 93 conditionally 
independent trios were extracted from the GAW18 data. 

SKAT and burden test for family-based design 

The family-based SKAT proposed recently [7] can be 
described as follows. For the ith trio, denote the speci- 
fic region of the genome by G, the offspring trait by Y, 
and the offspring genotype at the /th variant in G by 
Xij (1 < j <m), where m is the number of variants in 
the region G. We assume a generalized linear mixed 
effects model (GLMM) as follows: h[fii] = Qa+Xifi, 
where fit = E (Yj), h(.) is a known link function, a is the 
regression coefficients for the potential confounders C„ 
and P is the vectors of regression coefficients for the 
m variants X; , respectively. It is further assumed that 
the coefficients, Pjs, are independent random variables and 
follow an unspecified distribution with mean 0 and var- 
iance wj r. Here ifj can be considered as a weight that can 
be a function of the data (such as genotype frequencies 
estimated from the parents) or externally defined (such as 
a functional prediction score). Under the GLMM assump- 
tion, testing the null hypothesis of no genetic effect, that 



is, all /3s equal to 0, is equivalent to testing Hq : r = 0, that 
is, nonexistence of the variance component in the GLMM. 
Similar to SKAT, the score test for a family-based design 
is Qs = {Y — jloyk{Y - /to) where /xq = Ca for continu- 
ous traits, flo = logit^^ {Ca) for dichotomous traits, and 
k =[X- E{X\Xp)]WW[X- E{X—Xp)f is a weighted 
linear kernel. For the kernel, X represents the offspring 
genotype matrix, Xp represents the parental genotype 
matrix, and W = diag{ivi, . . . ,iVm) represents variant 
weights based on parental genotypes. In this study, we 
define Wj = Betaiff, a, b), where fj is is the estimated variant 
frequency based on parental genotypes. Under the null 
hypothesis, E{X\Xp) can be calculated using the laws of 
mendelian transmission. For the linear kernel, Qs has a 
simple expression: Qs = Ef.iif/E"! (y, - fnB) -£(Xi,-x|;))p, 

where X^ is the parental genotype data for family i at var- 
iant /'. It can be shown that the test statistic Qs has a limit- 
ing distribution of a mixture of chi-square distributions. 
Specifically, Qs converges weakly to X]j=i hy^i,j> where 
(Al, . . . , A,m) are the eigenvalues of matrix A^^^L' WWLA^/^) 
with LAL'^ = Cov{{X - E (X—Xp) y{Y - xa)\Xp, Y). 
Originally, the family-based SKAT assumes that all 
P coefficients are independently distributed. To allow 
for possible correlation of effects among different 
variants, a family kernel was proposed [2]: 

k=[X- £(X|Xp)] WRpW[X - E {X—Xp)Y, where 
Rp = (\ — p) I + pll^ specifies an exchangeable correla- 
tion matrix. The test statistic is = (y - fj,Q)^kp{Y — /lo)- 
When p = 0, Qp equals Qj, where all (3 coefficients are 
assumed independent. When p = 1, the test statistic 

becomes Qp = [T.^,, wf Zti {Yi - IHo) {Xy - E (x.j-Xf))]^ 

which is equivalent to the test statistics in the family-based 
association test (FBAT) [8]. The p value was calculated 
using the moment matching approach [9] or inverting the 
characteristic function [10], as considered by Lee et al [11]. 

Analysis strategy 

The goal of our analysis was to assess the power of 
detecting association between the simulated quantitative 
phenotypes (baseline SBP and DBP) and the causal 
genes (from the simulation answer sheet) on chromo- 
some 3 by the family-based SKAT and the burden test, 
whether or not adjusting for different proportions of 
causal variants. To ensure a fair comparison of power, 
the empirical type I error rates of all tests were evalu- 
ated by using the variable Ql (a quantitative trait in the 
simulation data set, simulated to be not associated with 
any of the SNPs). To evaluate the power of tests, we 
conducted the family-based SKAT and the burden test 
for each causal gene using, respectively, baseline SBP 
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and DBF as the response trait. Each of the 2 tests was 
conducted separately by including rare variants only, 
common variants only, and all variants of each gene. 
Therefore, 12 tests were conducted at each gene. Table 1 
describes the scenarios of the tests. The proportion of 
causal variants among all variants in each gene (referred 
to as strength of signal) may have impact on the power of 
tests. To adjust for that proportion, we conducted an 
analysis similar to the analysis in the unadjusted model 
under varying proportions of the causal SNPs (10%, 25%, 
and 50%). Then we performed another 36 analyses to 
compare the power of tests. Considering that the effect 
size of causal SNPs differs across SNPs, we fixed the cau- 
sal SNPs included in all the scenarios, but diluted the 
strength of signal by including differing numbers of non- 
causal SNPs that are randomly chosen from each gene to 
construct the proportion. For consistency and to prevent 
a proportion of 0 signal, we included only causal genes 
with at least 1 causal rare variant and at least 1 causal 
common variant. Each analysis was conducted using all 
200 simulated data sets. 

Power comparisons 

For the analysis without adjusting for proportion of cau- 
sal variants in the gene, we used the generalized esti- 
mating equation (GEE) [12] method to test for the 
differences in power between scenarios, accounting for 
the correlations induced by analyzing the same gene 12 
times. Specifically, of the 200 simulations, let Yij denote 
the number of successful rejection of the null hypothesis 
for the y'th test of the j>th gene, and pij the estimated 
power for each test, / = 1,2,. ..31, /' = 1,2,...12. We treated 
the YyS as correlated measures for the ith gene, and 
then we constructed a binomial regression model using 
GEE method to compare the power for each test: 



Yij ~ Binomial{200, pif) 

logit (pij) = Po + PiI^SBPij) + j82/(comm£)n,j) + ^iHcommonandrareij) 
+ MSKATij) 

where /Si represents the difference in power for using 
SBP rather than DBP as the outcome, fii represents the 
difference in power for using common variants instead 
of rare variants, P3 represents the difference in power 
for jointly using common and rare variants compared 
with using rare variants only, and P4 represents the dif- 
ference in power for using the family-based SKAT 
rather than the family-based burden test. Here I[A] 
denotes the indicator function, which equals 1 when A 
is true and 0 otherwise. Additionally, these effects are 
evaluated in similar model adjusting for the proportion 
of causal variants in the gene where / = 1,2,...,36. 

Results 

Trio and causal SNPs 

Using the approach stated in the methods section, we 
extracted a total of 93 trios from the GAW18 data. Our 
analysis focuses on chromosome 3 only. With knowl- 
edge of the simulating model, the 31 causal genes were 
available for the family-based SKAT and the burden test 
of all SNPs. When examining different power to detect 
the association under different proportions of causal 
SNPs, only the 16 causal genes that contain at least 1 
causal rare variant and at least 1 causal common variant 
were included. 

Gene-based test of all SNPs 

We applied the family-based versions of the burden and 
SKAT tests on the 93 trios for the gene-based associa- 
tion test of the 31 causal genes, using the 200 simulated 



Table 1 Scenarios of the 1 2 tests performed in comparing family-based SKAT and burden test using different types of 
variants (i.e., common vs. rare) and different types of outcome (i.e., DBP vs. SBP) 
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data sets. Variable Ql (a quantitative trait in the simula- 
tion data set that is not associated with any of the 
SNPs) was used to test for type I error and the empirical 
type I error rates are close to the nominal level of 0.05 
with a range of (0.043 to 0.059), which is within the 
95% confidence interval of the nominal level, that is, 
(0.035, 0.065). An earlier study [7] also reported that 
false-positive rates of the methods we applied were well 
controlled using large simulations with considerable 
sample size. Consequently, we used 0.05 as the critical 
value when calculating power. 

Figure 1 shows the power of correctly identifying cau- 
sal genes at the a = 0.05 level. Plots in the left-side 
panels show similar patterns to those in the right-side 
panels, which is not surprising considering that SBP and 
DBP are highly correlated phenotypes. According to the 



simulating model, gene MAP4 has a strong signal. Our 
results show both the family-based SKAT and the bur- 
den test are able to detect MAP4. 

In Figure IC to F, we observed 2 peaks in the plots, 
which are the results of genes PR0K2 and SERPl. For 
both genes, the peaks were observed only when com- 
mon variants were included in the test. This finding is 
consistent with the underlying simulating model, in 
which almost all the causal SNPs in these 2 genes are 
common variants. However, the family-based SKAT did 
not show any power beyond type I error to identify 
these two genes. 

Testing under different proportions of causal SNPs 

We conducted both the family-based SKAT and the 
burden test in scenarios containing different proportions 



■ SKAT 

■ Burden 



Trait — Systolic Blood Pressure 

Rare Variants only 



1 5 s 8 s : 
< ss s 2 - 
S " a : 



RSI 





Trait — Diastolic Blood Pressure 




0.35 

a3 

0.25 


Rare Variants only 




02 
0.15 

ai 

0,05 
0 








iiiiiiiiiii^ii^i|ippii. 


tfdp: 

RAD18 
ARHGEF3 
SUMFl 



Trait — Systolic Blood Pressure 

Common Variants only 



L.idi 


I.J 


..111 


IkliridriJilj.JdJ 1 


LOC152. 
CXCR6 
TUSC2 
ABTBl 

DMAS El.. 

SEMA3F 
B4GAIT4 
SERPl 
GPU 60 
SCAP 
MUC13 
RYBP 


Si" s i 



Trait — Diastolic Blood Pressure 

Common Variants only 




■ jLln.lij,Jt>li»..Ll 



S 3 i i I I 
£ » e 5 5 



Trait — Systolic Blood Pressure 

Common & Rare Variants 



■i 



i 3 I * S S 8 S 



S i " i 



Trait — Diastolic Blood Pressure 

Common & Rare Variants 




■dLijlJi,JiiL.i.LJ 



' - i i S i 3 s s K I i ■ 



SiiieliSllsl 



Figure 1 Power of tests using all available SNPs in the data (a = 0.05). Plots in the left panels use SBP as a continuous outcome. Plots in 
the right panels use DBP as a continuous outcome. All 6 plots use the same legend. Plots in the first row use SNPs with MAP <0.05. Plots in the 
second row use SNPs with MAP >0.05. Plots in the third row use both. 
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(ie, 10%, 25%, 50%) of causal SNPs in the analysis. 
Figure 2 shows the results. 

We did not observe a substantial impact of percentage 
of causal SNPs on the power of both tests from the 
plot, whether or not the analysis included rare variants. 
The power of both the family-based SKAT and the 



burden test is comparable across most genes. Two pro- 
minent exceptions are the genes MAP4 and MLHl. In 
Figure 2A, the family-based SKAT has much higher 
power than the burden test when using rare variants 
only in gene MAP4. The possible explanation is that the 
causal rare variants in MAP4 affect SBP in different 
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Figure 2 Power of tests under different proportions of causal SNPs (a = 0.05). Plots in the left panels use SBP as a continuous outcome. 
Plots in the right panels use DBP as a continuous outcome. All 6 plots use the same legend. Plots in the first row use SNPs with MAP <0.05. 
Plots in the second row use SNPs with IVIAF >0.05. Plots in the third row use both. 
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directions (also confirmed by the simulating model), 
thus the burden tests lose considerable power because 
causal variants P coefficients are in mixed directions. 
When we included common variants in the analysis, the 
power of the family-based SKAT decreased. This is a 
result of very few common SNPs being causal in MAP4; 
therefore, adding common variants increases the num- 
ber of noncausal SNPs and dilutes the causal signals. 
The burden test, however, had a slightly higher power 
than the family-based SKAT in gene MLHl. Examina- 
tion of the simulating model suggests that almost all 
variants in MLHl affect DBP and SBP in the same 
direction, so it is not surprising that the burden tests 
had comparable or better performance than the family- 
based SKAT for testing the variants in MLHl. 

Another interesting peak was observed at gene PTPLBl 
(Figure 2E and 2F) when using 50% causal SNPs and 
both common and rare variants. Both the family-based 
SKAT and the burden test showed higher power than 
other scenarios. The power is lower for testing either rare 
variants only or common variants only, which suggests 
that combining rare variants and common variants 
together may increase the power of both tests. 

Power comparisons 

In addition to visually comparing power as presented in 
Figures 1 and 2, we used GEE methods to more rigor- 
ously compare the power under different scenarios. Spe- 
cifically, we did not detect significant differences 
(p value >0.3) in power between the family-based SKAT 
and the burden test across all scenarios, whether we 
adjusted for the proportion of causal variants or not. 
However, after adjusting for proportions of causal var- 
iants, we found that on average, the tests using common 
variants only had less power compared to those using 
rare variants only, followed by the tests using both com- 
mon and rare variants. The test for overall difference in 
power yields a p value of 0.04. 

Discussion 

Our analysis using the GAW18 simulated baseline pheno- 
types and sequence genotypes with sample size of 93 con- 
ditionally independent trios shows limited power of both 
the family-based SKAT and the burden test. The low 
power is most likely the result of using a small number of 
trios and the weak signals in the simulating model. How- 
ever, we found that both models adequately controlled the 
type I error rates with only 93 trios. This agrees with the 
results of simulation studies in [7], where a large number 
of trios are considered (n - 500). Furthermore, after 
adjusting for proportion of causal variants, we found sig- 
nificant differences in power between tests using common 
variants only versus tests using rare variants only or both 



common and rare variants. Larger number of trios are 
needed to confirm this finding as suggested by [13,14]. 

Recently, 2 methods using SKAT for family data have 
been proposed [15,16]. Both of these methods take into 
account the whole family structure by using a marginal 
model with correlation structure specified by kinship 
matrix. However, there is a subtle difference between 
these 2 methods and our method. These 2 methods are 
comparing allele frequencies as a population-based test 
using the mixed-modeling framework to take into 
account the correlation among the individuals within a 
family, whereas our method is a transmission disequili- 
brium type (TDT) test, which is conditioned on parental 
genotypes and compares allelic transmissions. In the 
absence of population structure, the population-based 
association tests using the whole family are expected to 
be more powerful than our method. However, in the 
presence of population structure, the former tests may 
lead to inflated type I errors whereas our method is 
robust to population structure. Hispanic populations, 
such as the one used in this study, are likely to be 
admixed [17] and, therefore, the TDT-based method 
remains robust to potential population structure. 
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